“Calhoun 


Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


1971-06 


Investigation of speaker identification based 
on nasal phonation. 


Young, Robert Bryant. 


Monterey, California. Naval Postgraduate School 


http://ndl.handle.net/10945/15783 


Downloaded from NPS Archive: Calhoun 


| Calhoun is the Naval Postgraduate School's public access digital repository for 
D U DLEY research materials and institutional publications created by the NPS community. 
get Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS'‘s first 
KNOX appointed — and published — scholarly author. 


LIBRARY Dudley Knox Library / Naval Postgraduate School 
411 Dyer Road / 1 University Circle 
Monterey, California USA 93943 





http://www.nps.edu/library 


INVESTIGATION OF SPEAKER IDENTIFICATION 
BASED ON NAVAL PHONATION 


Robert Bryant Young 








MB gE ee ee 


LUnited Stiates 


Naval Postgraduate School! 













INVESTIGATION OF SPEAKER IDENTIFICATION 
BASED ON NASAL PHONATION 


by 


Robert Bryant Young 





imesus Advisor: Jeb e Campbell 


Tcmno 7 L 





Approved for public xnelease; distribution uneincted. 





Mivestleratlonsonsopedker Identification 


Based on Nasal Phonation 


by 


Robert Bryant Young 
Lieutenant Commander, United States Navy 
Enooemeosconmcollece, 1958 


PUbIeapedeiiepartral fulfiliment of the 
requirements for the degree of 


Vi otbhwonmestlENCE IN BLECTRICAL ENGINEERING 


] 


fr Ole ie 


NAVAL POSTGRADUATE SCHOOL 
June 1971 





ABSTRACT 


iimomencotcominmestil@aees the possibility of Speaker 
Identification through the use of Nasal Phonation. Short 
segments of a restricted set of words from one speaker 
were sampled, processed, and the resulting vector is used 
MmToMmmesenustme speaker.) Representative vectors were 
formed for several speakers and correlated with vectors 
representing individual words from "test'' speakers. The 
magnitude of the correlations of the word vectors with 
ferrous Speaker VeCtors were used to identify the speaker. 
imesework expands on Carlier work done in this field to 
the extent that it attempts to remove the subjective prep- 
aration of data and replace this instead with an objective 
DEOCeSS Of Computer mechanization. Some limited success 
was achieved and, just as important, critical problem areas 
are noted which, if improved upon as recommended, promise 
an improved identification capability. Two different word 
lists fundamental to the identification process were also 
investigated. Some data was obtained but it was not suf- 
Mor iembomsuec teste tiat one word list would be more pro- 
ductive than the other when used as the basis for speaker 
Palm iaea eat 1.ON. 

Recommendation is made to pursue further research in 
Speaker Identification using computer programming estab- 


lished during work on this thesis. 





TABLE OF CONTENTS 


im (MEROMUCTION 225655525 2550S ee 7 
1) NATURE OF THE PROBLEM -------------------------- 8 
A. SPEECH MECHANISM --------------------------- 8 
Imepuicoreciealebacsis, Of EXperiments ------- 10 

BIT CPE IR IUSIS) STUN UO NGIEV ICU IE SS SS eS 

ito tok oR llon sAND RECORDING =----=+---- 13 

Ee hovel bmiG@ONViriko RON ===-<<=<--<<<-----<----=--- 14 

G, FADE COWMERSION 223233552 = === 16 

DOS IMEO UE RO TONMPROCESGING =---------------- iy, 

WE EXPERIMENTS ---------------------- rrr reer 20 

A. EXPERIMENTAL RESULTS ----------------------- a 

V. CONCLUSIONS AND RECOMMENDATIONS ---------------- 34 

LIST OF REFERENCES ------------------------------------ So 

INITIAL DISTRIBUTION LIST ----------------------------- 40 

FORM DD 1473 ------------------ eee ee ee ee ee eee 41 





Table 


Wet i.. 


Weed lists 


Results 
Results 
Results 
Results 
Results 
Results 


Results 


of 
of 
of 
Gar 
hat 
Oye 
of 


Sih (NS JUAdsaileess 


IEC SS VTA 
Deo OS Teale nc 
Exe im mene 
Exe mlmnel c 
Exper IMenc 
Expiem lire ie 


Exe jie nt 


oo = >) 2a ae Zl 
One ES 3 = 24 
Two -------------------- 6 
Three ------------------ 27 
GINS scsSsect Sece 56 ee == 29 
Five ------------------- 30 
Six ---------- eee 3] 
Seven ------------------ ag 





Figure 
i: 
2. 


PioteeE FiGURES 


penenac es ltagram of the Basic Speech Process -- 9 


POneemecnimonm mine Word. Nominal’! ---------~------- iy 





ACKNOWLEDGEMENTS 


The author wishes to especially thank his Thesis 
PiveesOmweAssistant Professor J. D. Campbell, for his 
thoughtful guidance during the experiments and his assis- 
tance in the formulation of the thesis draft. Special 
thanks is also given to Assistant Professor V. M. Powers 
whose interest and enthusiasm for the thesis were warmly 


received. 





fren CROUUC TLON 


Much of what has been done in speech research in the 
Pecteiaseresulted directly or indirectly from the dream of 
Homer Dudley back in the 1920's. It is recalled that 
Mialeyewanted to transmit speech over a then newly-laid 
trans-Atlantic telegraph cable that had a bandwidth of 
Masemnenc ethan LO0Q cycles, In the process of trying to 
Meegdce and reduce speech to its basic elements for narrow 
bandwidth transmission, much has been learned about the 
Speech process itself and many useful and sometimes unex- 
pected applications have appeared. One of these applica- 
Piano wtomanethe taeld of speaker identification. Speaker 
identification, as the name implies, is an ability to rec- 
Ognize an individual speaker solely on the basis of his 
Beeakingetralts. This goal has by no means been realized, 
outcry because Of the difficulty in isolating those speech 
memameters Which are unique to a particular speaker. 

In this thesis, an attempt has been made to pursue a 
very promising effort [Ref. 1] toward this goal which is 
based on Nasal Phonation. The reason for this approach can 
be gained by a brief examination of the human speech mech- 


anism as we know it today. 





Penance Or Tie PROBLEM 


A. “SPEECH MECHANISM 

It has been established through research by Flanagan, 
et. al., [Ref. 2} and others that sound can be generated in 
the vocal system in three ways. Voice sounds are produced 
Maperevatine the air pressure in the lungs, forcing a flow 
bhrough the vocal cord orifice (the glottis} and causing 
the cords to vibrate. The interrupted flow produces quasi- 
periodic, broad-spectrum pulses, which excite the vocal 
MmeaGiweerr?cative Sounds are generated by forming a con- 
Striction at some point in the tract, usually toward the 
Hoemeniecnid, anda Lorcang arr through the constriction at a 
sufficiently high Reynolds number to produce turbulence. 
PaolmcecmoouUrce OL SOUNnd pressure 15 thereby created. Plo- 
sive sounds result from making a complete closure, again 
usually toward the front of the mouth, building up pressure 
behind the closure and abruptly releasing it. All these 
TOeccemaremrclativicwy, sproOdd in Spectrum. The vocal system 
fers doea time-Varyine filter to impose its spectral char- 
mevetmounes on the SOUrces - 

With reference to anatomy, the major parts of a man's 
vocal apparatus are shown schematically in Figure 1. The 
vocal tract proper is a non-uniform acoustic tube about 17 
cm. in length. It is terminated at one end by the vocal 
cords (or by the opening between them, the glottis) and at 


the other end by the lips. The cross-sectional area of the 









BRAIN VOCAL CAVITIES 


USED TO SELECT AND 


USEO TO CONTROL THE 
SUPPRESS OVERTONES 


SPEECH PROCESSES 


NASAL 


LOWER 
RESPIRATORY TRACT 


ae 
rye “Te get ree 
eee ee es 


ARTICULATORS 


YSED TO VARY 


THE VOCAL CAVITIES THE BREATH STREAM 


: ; ma TRACHEA 
47 aA LUNGS 
ae DIAPHRAGM 
wet 3s 
SK 


oe » USED TO SUPPLY 
4 


VOCAL CORDS 


USED TO MODULATE 
BREATH STREAM 


PRoUGemles ochemattc Diagram of the Basic Speech Process. 





madiGeelsecetermined by placement of the lips, jaws, tongue, 
and velum and can vary from zero (complete closure) to 
about 20 cm,°. 

limeatemllatyecavity, the nasal tract, can be coupled 
Momence vocal tract by the trapdoor action of the velun. 
The nasal tract begins at the velum and terminates at the 
nostrils. Inman, this cavity is about 12 cm. long and has 
a volume of about 60 cm.%. In non-nasal sound, the velum 
seals off the nasal cavity, and no sound is radiated from 
PileenmOStrils. It 1s the sound produced when this cavity 
memcoupledsuO® the vocal tract that is of particular in- 
Memesteti speaker identification based on nasal phonation. 

Paeiiconmceleal Basis ot Experiments 

HimencmscabeinOrethe parameters which might dif- 

ferentiate a given speaker from another, one is motivated 
by the desire to find a simple and compact solution which 
Jit eeittolvelencdeitselt to practical implementation. 
PereLore, One 1S prompted to investigate parameters which 
would be consistently available from only a very small 
sample of speech. A small sample can be a short-time dura- 
MIiGielimoEnlnsmressectymnasal phonation is an attractive 
approach because it is based on data derived from nasal 
consonants which occur in certain words for short time 
periods of from 35-50 milli-seconds depending on the speaker. 
Analysis has shown that during their production acoustic 
Poti Tommoceuns thnouleh the nostrils with the closed oral 


cigheveactingeascea shunt. The articulators do not move 


10 





Hipiicmenempertod Of Closure, and the vocal cavities remain 
nearly fixed. Hence, the power spectrum of radiated acous- 
tic energy is essentially steady as indicated by the sound 
Spectrogram of the word "nominal" in Figure 2. Note that 
the formants show very little movement for the duration of 
the two nasals. Researchers have commented on the marked 
extent to which the acoustic properties of nasal consonants 
Mm wirOmmspeaker to speaker {References 3 and 4]. Iif 
eieSceindications are true over a wide population of speakers, 
mens possible, then, that acoustic radiation produced dur- 
ing the phonation of nasal consonants could provide a strong 
mECO EO eS peaker identity. 

Another important feature of nasal consonants with 
PeonectmEO Speaker 1dentitication is the relative frequency 
with which they are used in spoken English. According to 
Tobmaswpkct. of], Nasals comprise 11% of the phoneme content 
of commonly spoken English. Thus, there is sufficient data 
available in a short sample of speech to provide some de- 
PeccmOmmspeaker ldentitication 2f oWr hypothesis 1s correct. 
Dimsibstanlteatven sot this approach, Glenn and Kleiner 
[Ref. 1] have reported that in experiments involving a pop- 
Dieplonmoteecn speakers an average identification accuracy 
of 97% was obtained. With an experimental population of 


CAMP ivespeakersS  ldentittcatlon accuracy was 937. 


11 





FREQUENC 
( KHz) x 


) : 
a oe 
3 eee ie 
Fi sees Pete; corse ; gk : oo & ww. 

pl es 2 " —t. > os _ Jtex- OF 2 Pes. xi ce a 

Pr ae ; oe ne eae 

Rj ‘ 7 2 1 © ze oe 

’ eee 3,3 sy os So 


¢ aS. opt . >> tes EPP ROG SSS a 


tect set ot Cee bar eb eb hoeveteeddepeanss eke. Aenea ope 


hae = 3 eeeeeetA4 ow 


MiItNAL 


“ 
—~ ee ee he te Beg ee ee tg: 


re ate: e) 


eee ee — er | 


is a i lk ll te cn a rl te et alli a tk MN me le A sn 


PiewticomZeoOomerramn Of the Word "Nominal". 


Ae 





Pie eae ck MENTAL PROCEDURE 


The approach taken in this work is along the lines of 
Glenn and Kleiner. However, an attempt has been made to 
use the computer to a greater extent for the processing and 
Mamapulation Of data and to develop facets of the problem 
not covered in their experiments. 

The basic analysis technique used throughout the ex- 
periments described in this thesis is the reduction of a 
time segment (the near steady state portion during nasal 
Maonat1onl sof selected words into normalized spectral com- 
Bemcmesewhien orm a vector describing a speaker. The same 
Pmoecess 1S aeain pertormed on another different set of 
words spoken by an "unknown" speaker; another vector is 
formed and is compared with known speaker vectors using the 
highest value of the cosine of the angle between the com- 


ase? eGtonseatomthne Griterion for identification. 


A. Whilst soe LeEGtiOn AND RECORDING 

Vaplousmword. lists were used in the different experi- 
ments and each will be specifically discussed under the 
heading of the applicable experiment. However, in each 
case a word list contained twenty different words beginning 
Pitieenemiacaimconsonant ‘mn’. A given speaker was usually 
required to record twenty words in slow sequence. The 
microphone used for this recording was a SURE MODEL 5755. 
The dynamic range was sufficient for the 1 - 3.5 KHz area 


of spectrum interest. The recorder used was an Ampex SP-300, 


13 





memory onc ecolvenrentiy available. As was later indi- 
cated, this recorder caused some problems because of the 
introduction of high background noise. Since noise bursts 
could not be tolerated on the tape (because of sensitivity 
Pemioomessonsine Circuits in the digitizing process), con- 
Piceraple care had to be taken during recording. In addi- 
tion to a rehearsal of the word list, each speaker was 
cautioned to avoid movements of the microphone, tapping, 
and other motions, which might introduce such bursts. Also, 
aottewaseonuly possible to use a single recorder output, 

Hep mOccdime mwas devised wicreby the VU meter could be used 
to judge word placement on the tape during the digitizing 
Igicess. ints procedure required the speaker to hum a tone 
meweere and aitter the werd list. This tone could be easily 
distinguished on the VU meter and greatly facilitated the 
digitizing process. Several seconds of silence were al- 


lowed between words and after the initial tone. 


ibs DIGITAL CONVERSION 

iice@mecordcdmwords were then passed through a l - 3.5 
KHz Kimono oeres 5520 filter and then input to a Cl 
MU Malo NCOnpicer se iere, atter amplification, logic 
Sersampmerreiats determaned the start time of the digitizing 
DEOGeccmunrcCieWwas pertormed at 12.5 KHz by an SDS 9300 
digatal computer. 

iced emt UicCmOL tne diei1ta2zine process is of criti- 


cal concern because of the need to sample the spoken nasal 


14 





Somsenant durine the proper interval -- i.e., when the ra- 
Grdvedeacoustic energy 1S most nearly in steady state. As 
Smpectea enero 2s a small transient period initially, then 
near steady state. Fujimuras' analysis [Ref. 3] has shown 
Varying degrees of formant variation toward the tail end of 
mecady state depending on the following vowel. Thus, there 
appears to be an optimum steady-state window extending for 
approximately ten milli-seconds during the time of consonant. 
The logic sensing arrangement allowed for a variable delay 
Pemenececmunc start Of the digitizing process after trig- 
eéring on the start of a word. Additionally, there was 
provision of a delay flop to insure that once triggered, 
the sensing circuit would be immune from further (false) 
triggers for a predetermined amount of time (experience 
has shown this time should be about 1.5 seconds). 

The number of samples taken was fixed for this pro- 
cessing arrangement at 128 which, at a sampling rate of 
WeeoeNiz, takes about ten milli-seconds. <A ten milli- 
second delay after triggering was used to insure that the 
sampling period lay within the steady state optimum sam- 
pling window. The choice of sampling frequency also al- 
lows a convenient feature (as will be shown later) of 
ivi aie -OUnben GOctiielents centered in approximately 
HOG cyvele, bandwidths . 

Litisn eile tal Output Of the conversion for one 
word list by one speaker consists of twenty blocks of 128 


samples each, which are recorded on a seven-track digital 


15 





tape as a single file of data. In actual practice, the 
process was ieoMeamecdmuntmecmaodin tOr cach tile to insure 
that each word was properly sampled at least once. 
Mimecdtivecccts., t1les were often discarded because it 
was found that they were not complete. A typical reason 
BOmetiese errors WaS noise bursts on the voice tape. Though 
merciipts were Made to keep this type of error to a minimum, 
there were still circumstances which required manipulation 
around these bursts, if data was to be saved. If the burst 
was distinguishable from the spoken words and occurred 
Pporereto tie Lwenty words, 1t was possible to delay ener- 
EEzinesthe Logie recognition/delay circuits until immedi- 
Seely prior to the first spoken word. The timing was 
physically difficult to implement in many instances and 
hence, the requirement to make several runs of the same 


data to ensure success. 


oe TAPE CONVERSION 

Seiec thlewols 9500) did not have the digital storage 
needed for analysis, use was made of the school's IBM 360. 
bPeed@iemtie cllbGent operating System supports only nine- 
Peewee peomtOr PURTRAN input filles, 1t was necessary to 
FolvemeunOlmumcuseven-tracx tape (produced in the digi- 
eee eerocess tO na nane-track tape. lo facilitate eval- 
uation of data, a decimal print out of data by block was 
Biomed d ss Ins ror al Siven experiment, the entire 


content of the seven-track tape was converted (included 


16 





redundant files); the decimal data was evaluated for com- 
Miomeneccmmdinmecien ss tinough the use of Job Control Lan- 
guage (JCL), selected files were brought forward and 
G€onverted for further processing. It should be noted that 
this editing process allowed mainly a quantitative check 
G@oeene data; it did not readily permit comparison of this 


data with the original analog data. 


D. POST CONVERSION PROCESSING 

Poetic mene wcheT oy d1Straibutaon by frequency was re- 
Mpereode une ext Step Im the process was to determine the 
Spectral content of the data blocks. The Fast Fourier 
Transform Algorithm [Ref. 6] was used to compute the com- 
plex Fourier coefficients whose magnitude squared are the 
energy spectrum. Since a 12.5 KHz sampling rate was used, 
J (enchlcwOL eine cneroy Spectrum represented an incre- 
Hempal bandwidth of 97.66 Hz (very close to the 100 Hz 
bandwidths that were manually quantized by Glenn and 
Micimeh) nga, closely paralleling Glenn and Kleiner's 
work, an approximately 2.5 KHz band of the spectrum was 
iolbdecasby discarding those values of the energy spectrum 
below 1025.36 Hz and above 3466.83 Hz. This spectrum band 
iP iiomect ut MenmIneluded twenty-five segments of 97.66 Hz, 
each of which was considered a component of a twenty-five 
dimensional vector which represented a particular sampled 
word. Each of the twenty-word vectors then underwent a 


normalization transformation according to the formula: 


iy 





Additionally another transformation was performed on 
eee vecror whieh was designed to emphasize the major pole 


Pidmencnmlapjor zero of the power spectrum. For the vector 
V' = (v! Wapiti) 


Let 


M 


1 
max {vi} 
mepresent the major pole of the spectrum, and 
= 1 ! 
m = min {vj} 


iiemmleatror zero, tlhen the transformed vector is given by 


Vee (vey ge VS 
1 2 25 
where 
oe eve =) 6 
al iL 
with 
a = 1/(M-m) 
B = m/(M-m). 


iMictmvcetorsetilis transtormmed will be called subvectors. At 
this point, for the purpose of the experiments, the first 
ten word subvectors were considered to have derived from a 
known speaker and the second ten word subvectors to have 
been uttered by an unknown or ''test" speaker. Each set of 


ten subvectors was then arithmetically averaged by component 


18 





Bi mevOm Miinc eVeCGtoOrs (representing reference and 
test. vectors for each speaker). For a given experiment 
enyolying several different speakers, the cosine of each 
CoaeevcerOnewlth cach reference vector was computed. The 
meomest Valuemin this correlation process was considered 
Pomlavcemtachulricasthe test vector with the speaker. If 
the test vector with highest correlation was in fact from 
micmsaie speaker, the result was judged a "match." 

Limcienottasene data Manipulation process, program out- 
pulty,anput Was returned to magnetic tape. However, at the 
Holmes where sSubyectors were generated, punched cards were 
Mecdstorretain the data. The use of cards allowed flexi- 
Ppecy in further analysis of data. 

This analysis took the form of correlation between 
PfeiesWbvector and the prime vector thereby generated. 
Also, differences were taken between components of sub- 
vectors and prime vectors so that frequency ranges exhibiting 
the greatest component difference could be isolated. 
Paabyeseddi tional programming was established to "edit" 
prime vectors by reducing the number of subvectors gener- 
Dati Meno nemOtncCOULSe, Lacility was retained to perform 
correlations and component differences as was done with the 
Gascworstne prime speaker vector being generated from a full 


ten subvectors. 


19 


a 





ly eee eR IMENTS 


With the computer programming mechanization established, 
Several experiments were undertaken to explore some varia- 
Preis Ini speaker 1dentification problem parameters. 

Because of Fujimuras' work [Ref. 3], which has shown 

varying degrees of formant variability toward the end of the 
near steady state condition depending on the following vowel, 
it was decided to examine lists of words categorized by a 
poamereular vowel following the nasal consonant "n". The 
nasal consonant was chosen at the beginning of the word in 
PmdcmNGOntaGcillitate digitization, thereby avoiding the 
problems of detection of the nasal consonant within the word 
itself. Initially, three word lists were chosen for the 
iWewelesounds 1 as in beet, € as in set and eas in sat. 
Two additional word lists were added to gain further infor- 
mation. These were a list containing a mixture of words 
mecimlTsStsmone tireugh three and finally a list containing 
words with various randomly selected vowels following the 
(i feMmeeoiselmantwesimnese tive lists are shown in Table lI. 
It will be noted that many of the words listed are not 
meaningful words in the English language. These words 
MetemrdiEieateasolely tor the purpose of filling each 
list to twenty words. A single speaker (the author) was 
Weedeiimetne sinmeral) experiments. 

The goal of Experiment One was to decide on a "best" 


Tomcmits pe rormldterm tdentitication studies with a number of 


20 





10 


i 
12 
13 
14 
ES 
16 
Ly, 
18 
19 
20 


nee 

need 
needle 
needless 


Negro 


neither 


Neorren Lec 


neology 
neon 
nmee@pny te 
Negress 
Neartic 
neef 
neem 


neev 


{bsC by ceed 


Word Lists 

22 3 
net nag 
neb nanny 
nS Sie nap 
WeCt ar Mar race 
negative napkin 
nekton narrow 
nelson nastily 
nemesis nasty 
nephew natty 
nepotism. natural 
nebula Navaho 
Neptune navigate 
ness Naz a Gene 
neck Nazi 
nestle nano 
eeu. 1 naphtol 
network naphtha 
Mev.e 1 nat 
nest national 
nef naf 


Aaa 


net 
nest 
nectar 


Nelson 


ness 
Next 
nag 
nanny 
nap 
HagGace 
napkin 
narrow 
nasty 


never 


5 


nice 
nimble 
Nyquist 
nominal 
numbering 
nonsense 
Oren 
nation 
Nixon 


notion 


nobile 
new 
nanny 
notice 
north 
nine 
numeral 
national 
noel 


Mo tity, 








different speakers. As a criterion for judgement, the 
TeecidmectsOlmecen worcs in each list was considered a 
Becct sword lst, Ihe "“best"’ word list was to be chosen 
Syetnewbastrs OL Maximum correlation in terms of cosine 
fomeiles between the “‘rererence'’ and 'test'' word lists. 

In Experiment Two words from the first word list were 
wench Oy tive ditterent male speakers. The vectors ob- 
tained trom these words were used to form reference and 
Meceeprime vectors which were then compared in the cor- 
mmitVone process described previously. 

iP Vetiiientmiini~ee sme peated EXperiment Two except that 
the logic circuit delay from the start of each word was 
Taree mili -sccond vice ten milli-seconds. 

Experiment Four used all the basic data of Experiment 
Mime wceDtetiat the tem subvectors used to form each prime 
TiarormOnuesbOtmn ene Reterence and Test Speaker, were 
"edited" to seven subvectors. The three subvectors re- 
moved in each case were those whose correlation with the 
original prime vector were smallest in value. Those vec- 
tors removed ranged in this correlation between .4443 and 
WOU M@—ELENWwaSenoted that this upper value along with two 
Tice Te ecornrelatton Values (of six total) were removed 
from speaker three's Reference and Test vectors. The in- 
SOrsmeoeeontumuemoval Or SUbVeCctOrs With large correlation 
values tends to degrade the data. This is further dis- 
cussed under Conslusions and Recommendations. 

Experiment Five used all the basic data of Experiment 
iiimcemehec Miameliadteagaine the three lowest correlating 


Ze 





ovCeEensmW lemme cach prame vector were “edited.'' These 
subvectors removed ranged in correlation between .2564 
and .7814. 

POM oma s ail attempt to identify thirteen 
male speakers. 

Experiment Seven used the same basic data of Experiment 
Pere xcept that the lowest three correlating subvectors 
Were edited from each prime vector. Three subvectors re- 


Neweaunanoced im Correlation between .2625 to .8639. 


pee SP ERIMENTAL RESULTS 

Pilemimilases che results of Experiment One. As 
Sere be seem trom the Table, the match between reference 
MieeneesSt vectors for the first word list is high and pos- 
sesses significant emergence. Emergence is here defined 
to mean the relatively large value of correlation of a 
Bests Vector with a given reference vector as compared 
TEEOOENeCrererterence vectors. This emergence quality was 
Mi oOnmIoOUsSsIM Other word lists and for this reason, it 
Ti Iomecendcdseonuse this list im later identification ex- 
periments - 

Word list three was also interesting. Firstly, the 
extreme high correlation between word list three and test 
word list four should be noted (this value was in fact the 
highest recorded in the table). The fourth word list, 
which it will be recalled was a mixture of word lists one 
through three, was not mixed well because, of the last ten 


words (of test list four), eight were repeated from Reference 


Zs 





PASE Sil 


Results of Experiment One 


Reference Word List 


24 


oc 
Word List - Z g 
1 ORS So ou o745 
Z 3847 8849 9432 
3 4503 8885 9633 
4 3978 8989 9661 
5 3491 7185 8504 





Bectwcnrec me ints probably explains the high correla- 
PioiwevalWcewariethac list. If, then, this value is ex- 
PomercdeeVOhGmisst three Correlated with test word list 
eiyPeececxtremely well; in fact, the best of those tested 
(although it does not have the nice emergence property 
meund with word list one). Because of this high correla- 
tion, it was decided to use word list three in Experiments 
Six and Seven. 

The results of Experiment Two are shown in Table III. 
As can be seen, clear matches are obtained for speakers 
PhomanasnOUrs. sopedkers one and five were near matches, 
PiemecnrtecctMarchutm eaen GCase being the second highest 
SeGretation value. Speaker three is a poor match placing 
third in correlation but with a significant magnitude dif- 
Momence mon tiesilchest correlation. In the search to 
DorLovestine Number Of matches, a review of the experiment 
Uearyedisel@sed that speaker two (the author) and speaker 
four had both spoken very slowly. 

Diewas thought that perhaps because of the speed of the 
Speech the steady state window was missed. Accordingly, 
the experiment was rerun using a shorter logic delay (one 
ia bias second ) . 

The results of Experiment Three are given in Table IV. 
This time clear matches were obtained for speakers two, 
four, and five. Speaker one again placed second for high- 
est correlation and speaker three showed poorest placing 


out of the five speakers. The reason for speaker three's 


ZS 





Test 


Speaker 


it 
2 


PABUES TIT. 


Results of Experiment Two 


Renmereiee sopedke iG 


Z0 





ES (es 


eS eel vy & 


Results of Experiment Three 


Reference Speaker 


1 Z 5 
9164 ite DSSS 
3/50 eZug 7026 
8243 ZENA BT 6324 
Sliy if Oears 8430 
SP abeke, 4826 8894 


27 





Pe@reecOurclatiene1s Unknown. it was noted in the diary 
that the Slecdwetm ceuGnas were close together, and it is 
posstpleythatecne log1e circuit did not have sufficient 
opportunity to Stabilize prior to a following word thus 
Mmenoalecingetalse data into the experiment. 

MicutcolmesmOleExperiment Four are given in Table V. 
PPmeane ee sce, ule editing process did improve the match 
Characteristics of the data set. However, the correlation 
Memo nwOrespedker tive With reterence five has now been 
PwtecatOnaspeor third, Speaker three's correlation 
values have not measurably improved. This might be ex- 
Meeecanpecduse Gf the removal of only high correlation 
Mn eectons in the editing process. 

Tmesresults of Experiment Five are shown in Table VI. 
Again the overall match record is three out of five. How- 
ever, speaker one was now matched and speaker four unmatched, 
although it should be noted that speaker four places second. 
Ppeaker senree as still the least likely match as was the 
SoC eimulecmuncatted Version of the experiment (Experiment 
Three). 

iicmwcst@lecmOrmrperIment Six are given in Table VII. 
iicteoeateemom< Glearemlatcchies. Of the remaining seven, two 
icc wiiteete cub yeones. 4a.c., place second, and one (speaker 
twelve) placed third. The worst case was the speaker who 
was mismatched by seven. It is interesting to note that 
HiewecOumimanany nad the comment "a little fast" for 


biticmotcaketesmucettation, This means that the speaker 


28 





eas Ile V7 


Results of Experiment Four 


Reference Speaker 


Za 


kes € 

Speaker 1 Z 3 
1 9176 ele7 5 4491 
2 1854 9483 FOS 
3 9095 LOLs 7076 
4 2586 217 © 8399 
5 8627 3447 8073 





TAIEE Vl. 


RoSUeescwOtenxXperiment Five 


Rererence Speaker 


30 


mest 

Speaker 1 2 3 
1 9064 1647 6169 
Z 71S) SHO Gail 7044 
3 6148 io SEED 
4 3411 6397 8069 
5 8691 5194 8263 





a* 


bors” 
OL p6~ 
OO Soe 
80T8° 
T9S8° 
[ase 
6TP6° 
BES © 
909S © 
808° 
SS) 
OT 
p886° 


VEUS. 
0 310 
8T86- 
Lo USS 
Sv06° 
US Sloe 
8676 — 
NIEHS) | 
SOs 
$806° 
T6v9> 
SP aie al 
eee (S 


Stvy6° 
Bieno 
oe 
CVEOe 
C2 Ve 
Sue ae 
6979 - 
CL8V- 
Q006~ 
b99P 
COO0D™ 
v876° 
Se es 


Oo 90 Scio bam Ore Woo 2 UGS | 
Sos0 VOC 8: 265 93s e Co eG. NO co. 
U2 36 ar OLcs colo wml LC McCoy mee aUGe 
BOSL USS SEI IC OS Se Gem LOO 2 Lass 
Sry Wome Ue nee IO mt Ore Ct eGo 
SOUS 6596 AUC ae Oe ORs cc a Coe 
O02 Ce OCs Os Ome seme 7 aero 
Be Omen CV omemmene ee ane cue) iCommmOICiG.a 
Vase  sL Sa VeeO OV el OV Lie oe ae OL 
Salo ST Li9%e bors §SvGe  SO0ly soscs 
BU) AIL Sie A RUGS yt 1 Sie, Te) 
LOC emer ge lo OV Seo a OO ico! 
OPO CUS eos. SOS CGe Slow 2 OSs 
6 8 ib 9 S p 


leyeodsg sauareysoy 


XTS JZUSUTIOdxYT Jo sz[nsoy 


meer rs PARSE 


Osa 
ace 
Eo Ga 
See © 
515) loys 
8609° 
VIS © 
O6SP° 
620% ° 
PENDS 
351 (2/10) © 
896% ° 
Soe 
& 


2ST o 
eee 
SSNs 
862 6° 
Sbv09° 
CU aa 
COs 
S059" 
6698 ° 
ya) 
O89S ° 
8T86° 
EOS ha 


c 


6 


9 


I! 


Layeoads 
459] 


ol 





said the words quickly, not emphasizing the nasal consonant. 
Memon wmeCiItomwacmtic CGily Speaker in this experiment for 
which such a comment was made. 

The results of Experiment Seven are given in Table 
VIII. There is a sharp reduction in matches as a result 
feeecncowcdiedme nrocess; only three matches in thirteen were 
obtained. 

Two of the three matches were also obtained in Experi- 
ment Six; the other match (speaker four) was a fifth place 
Maren dia ExXperiment Six. 

iicmavsenec Ct the Moped-for improvement in number of 
matches as a result of the editing process was disappoint- 
ing; however, it is understandable in view of the imbalance 
Mm clcectlonmeear the Subvectors rejected during editing. 

This will be discussed further in the following sec- 


“Eotte ae 


2 





SoG 
SC 
Oise 
LOS s 
S9OVZ- 
VG. 
bay. 
Dare 
97L6° 
(resi tey)s 
Ouse 
92267 


Senay 


Zak 


£928 - 
CL68~ 
0668" 
6192” 
Heal Sorte 
9L29° 
EGGS 
LUD 
COE 
COVS © 
SSpSy 
SEDs 


666° 


ase 


Of 79° 
OLS6° 
88S6° 
6£09° 
71S3 
SUoee 
Garo” 
aes On 
0887 ° 
9568 © 
Sieve 
6089° 
mee® | 


at 


DI88~ 
hs 
yee 
9706" 
Siz Cie 
T692° 
G5 eS) 
Oye 
es oom 
SaOze 
GIG 
CV88~ 
869T~ 


OT 


CVC Oot Li See Soo SOCLy AUGC c= 
S.O1O roe Os Se Oe cao LoL va Oo. 
Oe OOOO LE Cs Came Comoe Y C1 eC aes, 
S625 So0ee Salo) Soc sc) Os Sy wait ro. 
SOO OS oer hC Os Orel) 8 yO Lis Nemes 
6964 7096 S529" 2829  Sric ~t2s2 
S9GS eo oe SoG TOOLS US C20 aera 
De OmeeO CLS ey Uonmeas Vo) fan CO amano s 
UVES Sele SOCV2 NCCVS SSOCL eS v0le 
TPO 2s OS oS eo Saree al OC ae 
VS UU AOS ee 0 COSse): J eG Cos 
OS See Ol as 0 OE See tO mel Spec Ooi 


80S6° OTS9° 6006° LZrv8” 898T° PLeL° 


layeeds asa.uar1aytay 
UsAZS YUSUTISOdXY JO szZ[Nsoy 


IU Abs iva 


SOG» 
50 
SHO 
SG | 
asasiy 
9652 - 
LOG: 
pSOt’ 
Hays |e 
OZ8T° 
Sips 
OS ace 


S83” 


CO 
(SIE) 
UP SLs © 
T9S6° 
CL oa 
2 os 
Baim 
Spaea el 
pST8° 
So 
8900° 
brr6: 


CECLe 


L09S° 
Colo 
OvVV6" 
aE Si 
oO (6 
ALOE 
O03 18) 
0286° 
Scien 
8676" 
6TO0V- 
eas 


LOO 


t 


rayeeads 
3S9] 


53 





V. CONCLUSIONS AND RECOMMENDATIONS 


‘It 1S consadered that the results of the experiments 
show there 1s some merit to the use of nasal phonation for 
Speaker identification. Although the number of speakers 
was limited and the absolute match percentage did hover at 
about 50>, Meverthneless, it 1S believed that there may be 
Instances of closed groups of speakers where the ability 
to obtain a 50% credible match would be a useful adjunct 
to other information toward identifying that speaker. 
Looking at the 50% correct identification from another 
Homme sot View, 1t should be noted that, for example, in 
EepenriImene ox, the data also showed that for a given match, 
eres correlations obtained would be useful in eliminating 
about one third of the speakers from consideration in the 
identification process. 

It must be remembered that the process used in these 
eeepeTiments £Or the actual extraction of the subvectors was 
done strictly by machine whereas Glenn and Kleiner [Ref. 1] 
used manual methods. [It was the intent of this investiga- 
tion to remove all subjectivity from this process because 
of having experienced considerable error from previous work 
with subjective sonogram comparisons. 

The computer mechanization, as now established to han- 
(PomeimMicmomeslotevearce data, 1S sound; and, it is believed, 
that with further refinement, many inaccuracies now inherent 


in the process can be removed; thereby yielding the promise 


34 





Someiilahemepereenuare Of Correct identifications, even 
Ziti lartoumpopmildrtlon Of Unidentitied speakers. 

iieicomcntenea le point in the digitizing process is 
the method used to start the digitizing. As was pointed 
emempreviously, the start time is sensitive to noise and 
one must have a check to ensure that it is done properly. 
Under the present scheme, reliance was placed on a delay 
mePOp tO Start the process. However, what was not taken 
into consideration was the problem of noise in the tape 
moeorder. Lt is probable that because of this noise thres- 
hold there were instances when the analog voice voltage was 
beneath the noise for a fraction of time prior to delay 
Mmmtno by the tlip-tlop cirewit. The particular recorder 
used for the experiments (as mentioned earlier) was noisy 
and hence, inconsistencies in start of sampling may have 
Se ulligetale 

nem weOminprOVe sthiS SltUatiOon Iseto use a more 
ieese ttinee recorder not available for this project. Also, 
there should be a simultaneous graphical record of the an- 
alog voltage from the recorder along with the digital to 
oialogerecord of the digitized=yversion. For convenience, 
a milli-second time tick could also be plotted. This 
would assist the operator on a near real time basis to en- 
romeo omcakenatethe desired time. Yet another 
improvement would be the use of more sophisticated logic 
Pomc temeicmonOLerZance ouch Logic might use some type 
Of Nasal Consonant recognizer to insure that digitizing 
Heart ceaumimemtne steady state. 


55 





A second area needing improvement is the editing pro- 
Seooe hvechewrtiet@e improvement for start of digitization, 
rere 1S still the possibility that extraneous heconded 
sound energy not related to the desired signal will be 
present in the subvectors. Hence, the need to edit them. 
imcne experiments as described, the first criterion for a 
"bad'' subvector was its poor correlation with the prime 
vector it helped to generate. These subvectors were dis- 
carded mainly on this basis. However, upon closer analy- 
Sas) OL the twenty-five component differences between the 
wmeveceons and prime vectors, it was found that, although 
peer correlations could be relied upon in most cases as an 
miamreatton tonerc);ection, there were other cases where 
Mmiem@e were irregularities in components which compensated 
Smewanecner and hence, gave a higher correlation value 
and, therefore, were retained. These latter subvectors 
Smould also have been rejected for they were probably 
sampled for one reason or another during a non steady state 
condition. 

Another area of error overlooked in the editing pro- 
cess was the decision of how many subvectors to eliminate. 
Three were arbitrarily chosen for each speaker mainly be- 
cause it was convenient for computer processing and also 
Paves eCM@mEnateuthis small number would not drastically 
Bec CmeEtc Ene Chamaecter Of G€ach prime vector. However, 
as WaS seen in the experiments, when this selection rule 


was applied to all speaker subvectors, some high correlating 


36 





Subvectors are removed and hence the data is distorted. The 
discarding of subvectors should be made solely on the basis 
of a small magnitude correlation value and, as mentioned 
above, of component variance. 

All speaker data was retained throughout the experiments, 
and except for the editing described above, no attempt was 
made to "dress" the results by discarding data which was 
often misclassified. Speaker three in experiments Two 
marougim@ Fave was one that might have been so eliminated. 

This particular data was suspect because of his fast speech 
euameconsequent Uncertainty regarding proper digitization. 
speaker six in Experiments Six and Seven was another example. 

Another interesting fact is that words spoken by 
speaker two (the author) were always correctly identified 
(at least prior to the editing process). Though the data 
base is slim, it suggests that greater care in recording 
of words spoken by individuals taking part in the experi- 
ment would yield measurably improved results. 

Hiewoest word 11S to use 1S Still open to question. 
Data produced from the experiments does not favor either 
HMeenemieNomlrusts studied and in any case is not exhaustive 
enough for a firm decision. 

Fictii@menomauthor would like to point out that, al- 
though it was merely a tool to an end, the computer mech- 
anization and debugging of the various programs used for 
the experiments was formidable. It is hoped that interest 
DMmopeveetmmldcmtinitcation wil] continue at the Naval Post- 


graduate School and that the establishment of the programming 


37 





used in this thesis will be used as a stepping stone to 


further research. 


38 





iol or ORE PERENCES 


Gliomememnemand Kleiner, N., “Speaker Identification 
Based on Nasal Phonation,'' The Journal of the Acous- 
Mia tiscete hye Or mmentead, V. 45, p. 368-372, February 
1968. 


emacs be, Coker, G@. H., Rabiner, L. R., Schafer, 
ieemanasUmedas N.,  oynthetic Voices for Computers," 
PEPE Genuin ven /,.D. 22-45, October 1970. 


Fujimura, O., "Analysis of Nasal Consonants," The 
VOUmamor tne AGCOUStica: Society of America, v. 34, 
peelsoS-13875, December 1962. 


Wierconmmune hes) An HeousStiC Study of Nasality," 


(omni Oupsweceleanadnmecaring Research, v. 5, 
Peeves). June 1962. 


feppasseva. Relative Occurrence of Phonemes in American 
Pig@stcnwn these ournal of the Acoustical Society of 
miicmncanmeve= ola Dp. O51, 1959, 


Coomey Jae .wanawtukey, J. W., “An Algorithm for the 
Vechanemealculataons Of Complex Fourier Series," 
Moeiionatihes, or Computations, v. 19, p. 297, April 
6S), 


69 





Pitt AS PISTRIBUTION LIST 


Defense Documentation Center 
Came rons tacion 
Alexandria, Virginia 22314 


fabicomyneicode 0212 
Naval Postgraduate School 
Monterey, California 93940 


Pee roaressor J. ). Campbell, Code 52Cb 
Pepemtientoot Electrical Engineering 
Naval Postgraduate School 

Momtercyes Calitornia 935940 


EGUR Robert B. Young, USN 


USNAV SEC GRU ACTY 
FPO, New York 09513 


40 


Copres 





UNCLASSIFIED 


Security Classification 






DOCUMENT CONTROL DATA-R&D 


(Security classification of title, body of abstract and indexing annotation must be entered when the overall report is classified) 
1. ORIGINATING ACTIVITY (Corporate author) 2a. REPORT SECURITY CLASSIFICATION | 
Naval Postgraduate School UNCLASSIFIED 
Monterey, California 93940 2b. GROUP 


REPORT TITLE 
INVESTIGATION OF SPEAKER IDENTIFICATION BASED ON NASAL 
PHONATION 


4. DESCRIPTIVE NOTES (Type of report and,inclusive dates) 


Master's Thesis; June 1971 


$. AUTHOR(S) (First name, middie initial, last name) 


mewert, B. Younes 


6 REPORT DATE Ja, TOTAL NO. OF PAGES 7b, NO. OF REFS 
June 1971 42 6 . 


Ba. CONTRACT OR GRANT NO fa. ORIGINATOR’S REPORT NUMBER(S) 





b&. PROJECT NO. 





Cc. 95. OTHER REPORT NO(S) (Any other numbers that may be essigned 
this report) 


_ OISTRIBUTION STATEMENT 


Wpproved for public release; distribution unlimited. 





11. SUPPLEMENTARY NOTES 12. SPONSORING MILITARY ACTIVITY 
Naval Postgraduate School 
Menterey, Calitornra 95940 


 — Saree 


This thesis investigates the possibility of Speaker Identifica- 
Piomeenr ouch the wse of Nasal Phonation. Short segments of a 
restricted set of words from one speaker were sampled, processed, 
migmene Tesulting vector 1s used to represent the speaker. Rep- 
Pesemtative vectors were formed for several speakers and correla- 
Boamurcinevectors representing individual words from "test" 
speakers. The magnitude of the correlations of the word vectors 
with various speaker vectors were used to identify the speaker. 
iiromwork expands On earlier work done in this field to the ex- 
tent that it attempts to remove the subjective preparation of 
datawand replace this instead with an objective process of computer 
Mechnamuzation.  Gome limited success was achieved and, just as 
eimomtant, critical problem areas are noted which, if improved 
upon as recommended, promise an improved identification capability. 
iweomdtitrterent word tists fundamental to the 1dentification pro- 
cess were also investigated. Some data was obtained but it was 
DOUumsetrireleiemeensuccest that one word list would be more pro- 
iieemventhan tie other when wsed as the basis for speaker iden- 
ti incation. 

Recommendation is made to pursue further research in 
Speaker Identification using computer programming established during 
Woon thas thess se . — 


S/N a07= ! ificati 
/N 0101-807-6811 4] Secunty Classification Loateon 





jlaletal 18 216 eae 


Security Classification 


Pinks | Pinks | po kINK BO | B LIN» oN .ceaeae 


Sd ee ea 


Nasal Phonation 
Speaker Vdentitication 


V irovesl 473 (Back) Mune cctejep 


©101-807-6821 Security Classification A-31409 


42 


KEY WORDS 














thesY69 
Investiqat; 


8849 7 
DUDLEY KNOX LIBRARY 





